High performance Chinese OCR based on Gabor features, discriminative feature extraction and model training

نویسندگان

  • Qiang Huo
  • Yong Ge
  • Zhi-Dan Feng
چکیده

We’ve been developing a Chinese OCR engine for machine printed documents. Currently, our OCR engine can support a vocabulary of 6921 characters which include 6707 simplified Chinese characters in GB2312-80, 12 frequently used GBK Chinese characters, 62 alphanumeric characters, 140 punctuation marks and symbols. The supported font styles include Song, Fang Song, Kai, He, Yuan, LiShu, WeiBei, XingKai, etc. The averaged character recognition accuracy is above 99% for newspaper quality documents with a recognition speed of about 250 characters per second on a Pentium III-450MHz PC yet only consuming less than 2MB memory. In this paper, we describe the key technologies we used to construct the above recognizer. Among them, we highlight three key techniques contributing to the high recognition accuracy, namely the use of Gabor features, the use of discriminative feature extraction, and the use of minimum classification error as a criterion for model training.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Phishing website detection using weighted feature line embedding

The aim of phishing is tracing the users' s private information without their permission by designing a new website which mimics the trusted website. The specialists of information technology do not agree on a unique definition for the discriminative features that characterizes the phishing websites. Therefore, the number of reliable training samples in phishing detection problems is limited. M...

متن کامل

A discriminative linear regression approach to adaptation of multi-prototype based classifiers and its applications for Chinese OCR

This paper presents a new discriminative linear regression approach to adaptation of a discriminatively trained prototype-based classifier for Chinese OCR. A so-called sample separation margin based minimum classification error criterion is used in both classifier training and adaptation, while an Rprop algorithm is used for optimizing the objective function. Formulations for both model-space a...

متن کامل

Feature Extraction and Efficiency Comparison Using Dimension Reduction Methods in Sentiment Analysis Context

Nowadays, users can share their ideas and opinions with widespread access to the Internet and especially social networks. On the other hand, the analysis of people's feelings and ideas can play a significant role in the decision making of organizations and producers. Hence, sentiment analysis or opinion mining is an important field in natural language processing. One of the most common ways to ...

متن کامل

3D Gabor Based Hyperspectral Anomaly Detection

Hyperspectral anomaly detection is one of the main challenging topics in both military and civilian fields. The spectral information contained in a hyperspectral cube provides a high ability for anomaly detection. In addition, the costly spatial information of adjacent pixels such as texture can also improve the discrimination between anomalous targets and background. Most studies miss the wort...

متن کامل

Gabor-Based Kernel Partial-Least-Squares Discrimination Features for Face Recognition

The paper presents a novel method for the extraction of facial features based on the Gabor-wavelet representation of face images and the kernel partial-least-squares discrimination (KPLSD) algorithm. The proposed feature-extraction method, called the Gabor-based kernel partial-least-squares discrimination (GKPLSD), is performed in two consecutive steps. In the first step a set of forty Gabor wa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001